{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![LOGO](../img/MODIN_ver2_hrz.png)\n",
    "\n",
    "<center><h2>Scale your pandas workflows by changing one line of code</h2>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 5: Executing on a cluster environment\n",
    "\n",
    "**GOAL**: Learn how to connect Modin to a Ray cluster and run pandas queries on a cluster.\n",
    "\n",
    "**NOTE**: Exercise 4 must be completed first, this exercise relies on the cluster created in Exercise 4."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Modin performance scales as the number of nodes and cores increases. In this exercise, we will reproduce the data from the plot below.\n",
    "\n",
    "![ClusterPerf](../img/modin_cluster_perf.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Don't change this cell!\n",
    "import ray\n",
    "ray.init(address=\"auto\")\n",
    "import modin.pandas as pd\n",
    "from modin.config import NPartitions\n",
    "if NPartitions.get() != 768:\n",
    "    print(\"This notebook was designed and tested for an 8 node Ray cluster. \"\n",
    "          \"Proceed at your own risk!\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!du -h big_yellow.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "df = pd.read_csv(\"big_yellow.csv\", quoting=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "count_result = df.count()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# print\n",
    "count_result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "groupby_result = df.groupby(\"passenger_count\").count()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# print\n",
    "groupby_result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "apply_result = df.applymap(str)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# print\n",
    "apply_result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ray.shutdown()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Shutting down the cluster\n",
    "\n",
    "**You may have to change the path below**. If this does not work, log in to your \n",
    "\n",
    "Now that we have finished computation, we can shut down the cluster with `ray down`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!ray down modin-cluster.yaml"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### This ends the cluster exercise"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
